{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Metadata\n",
    "### Tracking and managing metadata of machine learning workflows in Kubeflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The goal of the [Metadata](https://github.com/kubeflow/metadata) project is to help Kubeflow users understand and manage their machine learning workflows by tracking and managing the metadata of workflows.\n",
    "\n",
    "\n",
    "Metadata comes with three components. From Kubeflow v0.6, Metadata is installed by default.\n",
    "\n",
    "- UI\n",
    "- Backend Store\n",
    "- Python SDK\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Core Concepts\n",
    "\n",
    "- _Run_ describes an execution of a machine learning workflow, which can be a pipeline or a notebook.\n",
    "- _Artifact_ describes derived data used or produced in a run.\n",
    "- _Execution_ describes an execution of a single step of a run with its running parameters.\n",
    "- _Workspace_ groups a set of runs and related artifacts and executions.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Python SDK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To use the latest publish `kfmd` library, you can run:\n",
    "!pip install kfmd==0.1.8 --user\n",
    "\n",
    "# Install other packages used in the turorial:\n",
    "!pip install pandas==0.24.2 --user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Restart the kernel to pick up pip installed libraries\n",
    "from IPython.core.display import HTML\n",
    "HTML(\"<script>Jupyter.notebook.kernel.restart()</script>\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify Installation\n",
    "from kfmd import metadata\n",
    "import pandas\n",
    "from datetime import datetime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Python SDK Usage\n",
    "\n",
    "Please follow commands here to understand basic usage of metadata SDK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a workspace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_workspace = metadata.Workspace(\n",
    "    # Connect to metadata-service in namesapce kubeflow in k8s cluster.\n",
    "    backend_url_prefix=\"metadata-service.kubeflow:8080\",\n",
    "    name=\"test_workspace\",\n",
    "    description=\"a workspace for testing\",\n",
    "    labels={\"foo\": \"bar\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a run in a workspace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_run = metadata.Run(\n",
    "    workspace=test_workspace,\n",
    "    name=\"run-\" + datetime.utcnow().isoformat(\"T\") ,\n",
    "    description=\"a run in workspace\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create an execution in a run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exec = metadata.Execution(\n",
    "    name = \"execution\" + datetime.utcnow().isoformat(\"T\") ,\n",
    "    workspace=test_workspace,\n",
    "    run=test_run,\n",
    "    description=\"execution example\",\n",
    ")\n",
    "print(\"An execution is create with id %s\" % exec.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log a data set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set = exec.log_input(\n",
    "        metadata.DataSet(\n",
    "            description=\"Training datasets\",\n",
    "            name=\"imagenet\",\n",
    "            owner=\"someone@kubeflow.org\",\n",
    "            uri=\"s3://path/to/dataset\",\n",
    "            version=\"v1.0.0\",\n",
    "            query=\"SELECT * FROM mytable\"))\n",
    "assert data_set.id\n",
    "print(\"data set id is %s\" % data_set.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log a model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = exec.log_output(\n",
    "    metadata.Model(\n",
    "            name=\"MNIST\",\n",
    "            description=\"model to recognize handwritten digits\",\n",
    "            owner=\"someone@kubeflow.org\",\n",
    "            uri=\"s3://my-bucket/mnist\",\n",
    "            model_type=\"neural network\",\n",
    "            training_framework={\n",
    "                \"name\": \"tensorflow\",\n",
    "                \"version\": \"v1.0\"\n",
    "            },\n",
    "            hyperparameters={\n",
    "                \"learning_rate\": 0.5,\n",
    "                \"layers\": [10, 3, 1],\n",
    "                \"early_stop\": True\n",
    "            },\n",
    "            version=\"v0.0.1\",\n",
    "            labels={\"mylabel\": \"l1\"}))\n",
    "assert model.id\n",
    "print(\"model id is %s\" % model.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log an evaluation(metrics) of a model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics = exec.log_output(\n",
    "    metadata.Metrics(\n",
    "            name=\"MNIST-evaluation\",\n",
    "            description=\"validating the MNIST model to recognize handwritten digits\",\n",
    "            owner=\"someone@kubeflow.org\",\n",
    "            uri=\"s3://my-bucket/mnist-eval.csv\",\n",
    "            data_set_id=data_set.id,\n",
    "            model_id=model.id,\n",
    "            metrics_type=metadata.Metrics.VALIDATION,\n",
    "            values={\"accuracy\": 0.95},\n",
    "            labels={\"mylabel\": \"l1\"}))\n",
    "assert metrics.id\n",
    "print(\"metrics id is %s\" % model.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List all models in the workspace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pandas.DataFrame.from_dict(test_workspace.list(metadata.Model.ARTIFACT_TYPE_NAME))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Lineage Tracking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"model id is %s\\n\" % model.id)\n",
    "    \n",
    "# Find the execution that produces this model.\n",
    "output_events = test_workspace.client.list_events2(model.id).events\n",
    "assert len(output_events) == 1\n",
    "execution_id = output_events[0].execution_id\n",
    "\n",
    "# Find all events related to that execution.\n",
    "all_events = test_workspace.client.list_events(execution_id).events\n",
    "assert len(all_events) == 3\n",
    "\n",
    "print(\"\\nAll events related to this model:\")\n",
    "pandas.DataFrame.from_dict([e.to_dict() for e in all_events])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Real world example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "\n",
    "# Helper libraries\n",
    "import numpy as np\n",
    "import os\n",
    "import subprocess\n",
    "import argparse\n",
    "import time\n",
    "\n",
    "from kfmd import metadata\n",
    "\n",
    "\n",
    "# Reduce spam logs from s3 client\n",
    "os.environ['TF_CPP_MIN_LOG_LEVEL']='3'\n",
    "\n",
    "def preprocessing(mnist_execution):\n",
    "  fashion_mnist = keras.datasets.fashion_mnist\n",
    "  (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()\n",
    "\n",
    "  # scale the values to 0.0 to 1.0\n",
    "  train_images = train_images / 255.0\n",
    "  test_images = test_images / 255.0\n",
    "\n",
    "  # reshape for feeding into the model\n",
    "  train_images = train_images.reshape(train_images.shape[0], 28, 28, 1)\n",
    "  test_images = test_images.reshape(test_images.shape[0], 28, 28, 1)\n",
    "\n",
    "  class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',\n",
    "                'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']\n",
    "\n",
    "  print('\\ntrain_images.shape: {}, of {}'.format(train_images.shape, train_images.dtype))\n",
    "  print('test_images.shape: {}, of {}'.format(test_images.shape, test_images.dtype))\n",
    "\n",
    "  return train_images, train_labels, test_images, test_labels\n",
    "\n",
    "def train(train_images, train_labels, epochs, model_summary_path=None, mnist_execution=None):\n",
    "  if model_summary_path:\n",
    "    logdir=model_summary_path # + datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
    "    tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)\n",
    "\n",
    "  model = keras.Sequential([\n",
    "    keras.layers.Conv2D(input_shape=(28,28,1), filters=8, kernel_size=3,\n",
    "                        strides=2, activation='relu', name='Conv1'),\n",
    "    keras.layers.Flatten(),\n",
    "    keras.layers.Dense(10, activation=tf.nn.softmax, name='Softmax')\n",
    "  ])\n",
    "  model.summary()\n",
    "\n",
    "  model.compile(optimizer=tf.train.AdamOptimizer(),\n",
    "                loss='sparse_categorical_crossentropy',\n",
    "                metrics=['accuracy'])\n",
    "\n",
    "def export_model(model, model_export_path):\n",
    "  version = 1\n",
    "  export_path = os.path.join(model_export_path, str(version))\n",
    "\n",
    "  tf.saved_model.simple_save(\n",
    "    keras.backend.get_session(),\n",
    "    export_path,\n",
    "    inputs={'input_image': model.input},\n",
    "    outputs={t.name:t for t in model.outputs})\n",
    "\n",
    "  print('\\nSaved model: {}'.format(export_path))\n",
    "\n",
    "\n",
    "def main(model_export_path=None, model_summary_path=None, epochs=5):\n",
    "  \"\"\"Fashion MNIST Tensorflow Example.\n",
    "    Args:\n",
    "      model_summary_path: Model export path.\n",
    "      model_summary_path: Model summry files for Tensorboard visualization\n",
    "      epochs: Training epochs. \n",
    "    \"\"\"\n",
    "\n",
    "  # Setting up metadata tracking\n",
    "  mnist_workspace = metadata.Workspace(\n",
    "    # Connect to metadata-service in namesapce kubeflow in k8s cluster.\n",
    "    backend_url_prefix=\"metadata-service.kubeflow:8080\",\n",
    "    name=\"mnist\",\n",
    "    description=\"Mnist image classification\",\n",
    "    labels={\"env\": \"develop\"})\n",
    "\n",
    "  mnist_run = metadata.Run(\n",
    "    workspace=mnist_workspace,\n",
    "    name=\"run-\" + datetime.utcnow().isoformat(\"T\") ,\n",
    "    description=\"a run in mnist workspace\",\n",
    "  )\n",
    "\n",
    "  mnist_execution = metadata.Execution(\n",
    "    name = \"execution\" + datetime.utcnow().isoformat(\"T\") ,\n",
    "    workspace=mnist_workspace,\n",
    "    run=mnist_run,\n",
    "    description=\"execution example in mnist run\",\n",
    "  )\n",
    "\n",
    "  start_time = time.time()\n",
    "  train_images, train_labels, test_images, test_labels = preprocessing(mnist_execution)\n",
    "  model = train(train_images, train_labels, epochs, model_summary_path, mnist_execution)\n",
    "\n",
    "  dataset = mnist_execution.log_input(\n",
    "      metadata.DataSet(\n",
    "            description=\"MNIST Training datasets\",\n",
    "            name=\"mnist\",\n",
    "            owner=\"someone@kubeflow.org\",\n",
    "            uri=\"s3://path/to/dataset/mnist\",\n",
    "            version=\"v1.0.0\",\n",
    "            query=\"SELECT * FROM mytable\"))\n",
    "  print(\"data set id is %s\" % dataset.id)\n",
    "\n",
    "  if model_export_path:\n",
    "    export_model(model, model_export_path)\n",
    "\n",
    "  metadata_model = mnist_execution.log_output(\n",
    "      metadata.Model(\n",
    "        name=\"MNIST\",\n",
    "        description=\"model to recognize handwritten digits\",\n",
    "        owner=\"someone@kubeflow.org\",\n",
    "        uri=model_export_path,\n",
    "        model_type=\"neural network\",\n",
    "        training_framework={\n",
    "            \"name\": \"tensorflow\",\n",
    "            \"version\": \"v1.0\"\n",
    "        },\n",
    "        hyperparameters={\n",
    "            \"learning_rate\": 0.5,\n",
    "            \"layers\": [10, 3, 1],\n",
    "            \"early_stop\": True\n",
    "        },\n",
    "        version=\"v0.0.1\",\n",
    "        labels={\"mylabel\": \"l1\"}))\n",
    "  print(\"model id is %s\" % metadata_model.id)\n",
    "\n",
    "  metrics = mnist_execution.log_output(\n",
    "    metadata.Metrics(\n",
    "            name=\"MNIST-evaluation\",\n",
    "            description=\"validating the MNIST model to recognize handwritten digits\",\n",
    "            owner=\"someone@kubeflow.org\",\n",
    "            uri=\"s3://my-bucket/mnist-eval.csv\",\n",
    "            data_set_id=dataset.id,\n",
    "            model_id=metadata_model.id,\n",
    "            metrics_type=metadata.Metrics.VALIDATION,\n",
    "            values={\"accuracy\": 0.95},\n",
    "            labels={\"mylabel\": \"l1\"}))\n",
    "\n",
    "  # Measure running time\n",
    "  duration_in_seconds = time.time() - start_time\n",
    "  print(\"This model took\", duration_in_seconds, \"seconds to train and test.\")\n",
    "  mnist_execution.log_output(\n",
    "      metadata.Metrics(\n",
    "              name=\"MNIST-evaluation\",\n",
    "              description=\"validating the MNIST model to recognize handwritten digits\",\n",
    "              owner=\"someone@kubeflow.org\",\n",
    "              uri=\"s3://my-bucket/mnist-eval.csv\",\n",
    "              data_set_id=dataset.id,\n",
    "              model_id=metadata_model.id,\n",
    "              metrics_type=metadata.Metrics.VALIDATION,\n",
    "              values={\"time\": duration_in_seconds},\n",
    "              labels={\"mylabel\": \"l1\"}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Navigate to the Kubeflow Artifact Store"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can go to central dashboard -> Artifact Store to check details.\n",
    "![artifact-store](images/artifact_store.jpg)\n",
    "\n",
    "You can click name and check details.\n",
    "![artifact-mnist](images/artifacts_mnist.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
